Kriging Convolutional Networks
Gabriel Appleby*, Linfeng Liu* , Li-Ping Liu
Department of Computer Science, Tufts University
{Gabriel.Appleby, Linfeng.Liu, Liping.Liu}@tufts.edu
Abstract
Spatial interpolation is a class of estimation problems where
locations with known values are used to estimate values at
other locations, with an emphasis on harnessing spatial lo-
cality and trends. Traditional Kriging methods have strong
Gaussian assumptions, and as a result, often fail to capture
complexities within the data. Inspired by the recent progress
of graph neural networks, we introduce Kriging Convolu-
tional Networks (KCN), a method of combining advantages
of Graph Convolutional Networks (GCN) and Kriging. Com-
pared to standard GCNs, KCNs make direct use of neigh-
boring observations when generating predictions. KCNs also
contain the Kriging method as a specific configuration. We
further improve the model’s performance by adding atten-
tion. Empirically, we show that this model outperforms GCNs
and Kriging in several applications. The implementation of
KCN using PyTorch is publicized at the GitHub repository1:
https://github.com/tufts-ml/kcn-torch.
Spatial data is ubiquitous in a wide variety of fields such
as ecology (Fink et al. 2010), economics (Gao and Liu
2014), and meteorology (Xingjian et al. 2015). A common
task within these fields is to estimate values at target loca-
tions from nearby known values. Improving these estima-
tions should provide clear benefits for these applications. Es-
timation techniques tailored to spatial data must leverage the
fact that every data point is associated with a location. Most
importantly, these techniques should be able to capture the
spatial correlation among these locations.
In many fields, the most prevalent method for spatial data
modeling is kriging (Cressie 1991). The fundamental as-
sumption of kriging is that observations at locations are from
an underlying Gaussian process. After estimating the var-
iogram, which is essentially the strength of spatial correla-
tions between data points, kriging uses a linear interpolation
of observed values to predict the value at a new location. The
kriging prediction is the best linear unbiased estimator for
*Equal contribution. The first two authors are arranged by the
alphabet order
Copyright © 2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
1This is a new implementation updated in 2023 using PyTorch,
so the performation values are slightly different with those reported
in the paper.
spatial points given its Gaussian assumption. However, this
assumption is quite constrictive, as data in many applications
are not from a Gaussian distribution. For example, we will
show in our experiments that this assumption leads to poor
performance when estimating integer counts that contain a
significant fraction of zeros.
Researchers also use flexible machine learning algorithms
(Hengl et al. 2018) for spatial data modeling. Given the huge
success of GNNs and the similarity between spatial data and
graph data, researchers have started to apply Graph Neu-
ral Networks (GNN) (Wu et al. 2019) to spatial data (Li
et al. 2017; Yu, Yin, and Zhu 2017; Zhu and Liu 2018;
Yan et al. 2019). GNNs were first developed for explicit
graph data, but can model any data that can be transformed
into a graph either by their spatial vicinity or their physi-
cal connections (e.g. routes). The main idea is to propagate
information along graph edges, so graph nodes can share in-
formation during the learning process. GNNs are relatively
generic, and can find nonlinear relationships between the in-
puts, hidden layers, and neighborhood information of each
node. By design, GNNs are more flexible than kriging.
However, kriging has an advantage over GNNs: kriging
directly uses observed training labels to predict the label of
a new data point. In comparison, there is no straightforward
way to feed training labels as input to a GNN. It is not fea-
sible to directly feed training labels as part of the input be-
cause the GNN will directly output the given label of a train-
ing data point and learn nothing. Furthermore, spatial data
modeling requires inductive learning – the model needs to
be able to make predictions for new locations that are not
in the graph formed by training data. While kriging is in-
trinsically inductive, only a few GNNs such as GraphSAGE
(Hamilton, Ying, and Leskovec 2017) can work inductively.
Inspired by these two observations, we develop a new
model, the Kriging Convolutional Network (KCN), as an
improvement to GNNs. The KCN is still a type of GNN.
However, it does not form a single large graph over all data
points. Every time a KCN fits the label of a data point (call
it the center), it forms a small graph over the center and its
neighboring training data points. These neighbors are the K
nearest neighbors according to a distance metric. In the input
to the KCN, we hide the label of the center node. The input
arXiv:2306.09463v1  [cs.LG]  15 Jun 2023

consists of feature vectors for all nodes in the graph (K + 1
nodes), as well as the labels of the neighbors. The KCN also
needs the adjacency matrix of the graph, which is defined to
be the spatial kernel matrix or a normalized version of that.
The target value of the KCN is the label of the center node.
We iterate over all of the training data, treating each node as
the center to train the KCN model. The KCN uses the same
structure to predict the label of a new data point.
The KCN combines the best parts of both models. In com-
parison to the GNN, it is able to directly leverage training la-
bels in prediction, and no re-training is necessary when new
data points are introduced. In contrast to kriging, the KCN
is more versatile. On a large dataset where overfiting is not
an issue, the KCN has a clear advantage over kriging. Even
though the KCN’s underlying mechanisms are very different
from kriging, our theoretical analysis reveals a deep connec-
tion between the two models. In fact, with a special config-
uration, the KCN can emulate kriging.
In summary, this work has three contributions:
• the development of the KCN, which is a GNN that di-
rectly uses training labels for prediction;
• the theoretical result showing that the KCN approxi-
mately recovers local universal kriging; and
• empirical studies indicating the KCN’s advantage over
baseline models.
Related Work
Kriging (Cressie 1991) has been widely used in spatial data
modeling. Using Kriging to model non-Gaussian data is of-
ten accomplished through careful transformation of labels
(Saito and Goovaerts 2000). However, it is not always fea-
sible to transform a variable to be Gaussian (Dance 2018).
One direction of exploration is to weaken the Gaussian as-
sumption of kriging models (Wallin and Bolin 2015), but
these methods are often specially designed for their respec-
tive applications.
GNNs are neural networks that work on graph data (Gori,
Monfardini, and Scarselli 2005). Wu et al. (2019) and Zhou
et al. (2018) have done extensive surveys of this topic. A
GNN typically consists of a few layers, each of which has
a non-linear transformation of the hidden vectors and a step
of information propagation between nodes. GNN architec-
tures differ by how they propagate information among graph
nodes (Kipf and Welling 2016; Atwood and Towsley 2016;
Hamilton, Ying, and Leskovec 2017; Veliˇckovi´c et al. 2018).
When a GNN is applied to spatial data (Wu et al. 2019;
Yan et al. 2019), one first builds a graph over data points
in the spatial area and then runs the GNN on the graph. To
the best of our knowledge, all of these methods feed features
as the input and fit labels by the output of the network. In
this work, we develop our KCN model based on the Graph
Convolutional Network (GCN) (Kipf and Welling 2016) and
Graph Attention Network (GAT) (Veliˇckovi´c et al. 2018).
Background
Suppose there are N spatial data points, (s, X, y)
=
(si, xi, yi)N
i=1, where si, xi, and yi are respectively the loca-
tion, the feature vector, and the label of data point i. Usually
a location si is a GPS coordinate, si ∈R2. There are d
features in a feature vector xi ∈Rd. The domain of the tar-
get value yi is application-dependent. For example, yi ∈N
when yi is a count, and yi ∈R+ when yi represents the pre-
cipitation level. One important task of spatial data modeling
is to predict or estimate the value y∗for a new location s∗
with a feature vector x∗. Let ˆy∗denote the prediction.
Kriging
There are many variants of kriging, of which universal krig-
ing is the most appropriate for the setting above. Universal
kriging has the following model assumption (Eq. 3.4.2 in
(Cressie 1991)).
yi = β⊤xi + ϵ(si), i = 1, . . . , n, ∗
(1)
Here β is the coefficient vector. ϵ(·) is a zero-mean
random process with variogram 2γ(·). The variogram
2γ(·), which specifies the spatial correlation between data
points, is a function of spatial distance: 2γ(∥si −sj∥) =
E

(ϵ(si) −ϵ(sj))2
. The variogram often takes a special
function form with its parameters estimated from the data.
With this model assumption, kriging minimizes the expected
squared error, Ey∗

(ˆy∗−y∗)2
, in closed form. Then the
prediction y∗of universal kriging is ˆykriging
∗
= λ⊤(y −µ)
with
λ = Γ−1  γ −BX⊤Γ−1γ + Bx∗,

,
(2)
with B = X(X⊤Γ−1X)−1, Γ = [γ(∥si −sj∥)]n
i,j=1, and
γ = [γ(∥si −s∗∥)]n
i=1.
Note that kriging uses known training labels as well as
all features as the input to make the prediction. Despite its
complex form, kriging has a subtle relation with the KCN
model proposed later.
Graph Convolutional Networks
Suppose we have a graph G
=
(V, E), where V
=
{1, . . . , M} is set of data points, and E is the edge set. Each
data point i ∈V has a feature vector xi and a label yi. Later,
we will collectively denote ( ˜X, ˜y) as a stack of all features
and labels for notational convenience. Let A denote the ad-
jacency matrix of the graph, and ¯A denote the normalized
adjacency matrix,
¯A = D−1
2 (A + I)D−1
2 ,
(3)
with D = diag(A1 + 1) being the degree matrix plus one.
Then a GCN (Kipf and Welling 2016) takes ¯A and ˜X as
the input and fits known labels in ˜y as the target. The GCN
consists of L GCN layers. Each GCN layer ℓtakes an input
Hℓ−1 ∈Rn×dℓ−1 and outputs a matrix Hℓ∈Rn×dℓ. The
layer is parameterized by a matrix Wℓwith size dℓ−1 × dℓ.
Formally, the GCN is defined by
H0 = ˜X,
(4)
Hℓ= σ
  ¯AHℓ−1Wℓ
,
ℓ= 1, . . . , L
(5)
ˆy = HL.
(6)
Here σ(·) is a non-linear activation function.

¡¡¡¡¡¡¡ HEAD The GCN considers a semi-supervised task,
in which only part of the labels ˜y are observed. The GCN
then defines its training loss based on the known labels, and
aims to predict unknown labels. In our method, we will con-
sider to predict one data point ˆy∗at a time; thus M = N +1.
To predict a scalar for a graph node, the last layer HL has
only dL = 1 column, and its entry corresponding to the
new data point (s∗, x∗) is the prediction ˆy∗. In practice, a
two-layer GCN with L = 2 is often sufficient. ======= A
GCN considers a semi-supervised task, in which only part
of the labels ˜y are observed. A GCN then defines its train-
ing loss based on the known labels, and aims to predict un-
known labels. To apply a GCN to the previous task, we form
a graph for (s, s∗), and put all features to the graph nodes.
When predicting a scalar for a graph node, the last layer
HL has only dL = 1 column, and its entry corresponding
to the new data point (s∗, x∗) is the prediction ˆy∗. ¿¿¿¿¿¿¿
1a47e4691e50e54382fde6fcb76d469264f9d92d
Kriging Convolution Network
In this work, we develop a new learning model that directly
use training labels as the input for predictions. We call this
model a Kriging Convolution Network (KCN).
We will first demonstrate how a KCN will be used for pre-
diction. Let’s treat a KCN model as a function KCN(·; θ)
parameterized by θ. When predicting the label of a new
data point (s∗, x∗), the model ideally should use all in-
formation available to make the prediction, that is, ˆy∗=
KCN(s, X, y, s∗, x∗). However, it is not feasible to con-
sider all of the training points for just one prediction. It is
not necessary either, because data points far from s∗of-
ten have little influence over y∗in many spatial problems.
Therefore, we use the K nearest neighbors of the new data
point as the input. Denote the index set of these neighbors as
α∗⊂{1, . . . , N}, then the predictive function becomes
ˆy∗= KCN(sα∗, Xα∗, yα∗, s∗, x∗).
(7)
To train the model, we treat every training point i as a test
point and fit its training label yi. The model’s output, ˆyi, is
compared against the true label. The difference of the two
is measured by some loss function loss(yi, ˆyi). The learning
objective of the model is to minimize the summation of all
training losses
min
θ
N
X
i=1
loss(yi, ˆyi),
ˆyi = KCN(sαi, Xαi, yαi, si, xi).
(8)
Here αi is the set of neighbors of i in the training set.
Now we construct the network architecture of the KCN,
i.e. the function KCN(sαi, Xαi, yαi, si, xi). Instead of us-
ing locations, si and sαi, as features, we define a complete
graph over the data points i and its neighbors and then use a
GCN to construct the predictive model. Denote βi = {i}∪αi
as the set containing the data point i and its neighbors. We
first define a graph over βi by constructing its adjacency ma-
trix A from a Gaussian kernel,
Ajk = exp

−1
2ϕ2 ∥sj −sk∥2
2

, ∀j, k ∈βi.
(9)
Input: (s, X, y), K
Output: θ = (W1, . . . , WL, wden)
for i ←0 to N do
βi = the K nearest neighbors of si and i ;
Compute A from sαi by (9) ;
Prepare H0 from (xi, Xαi, yαi) by (10) ;
end
for iter ←0 to num training iter do
i = iter%N ;
HL = GCN(A, H0; W1, . . . , WL) ;
ˆyi = σ(e⊤HL)wden ;
Compute loss(yi, ˆyi) and its derivative ;
Update weights θ = W1, . . . , WL, wden
end
Algorithm 1: The training algorithm of KCN
Here ϕ is the kernel length, which is a hyperparameter. In
this graph, the edge (j, k) has a large weight when j and k
are near each other and vice versa.
Next we define the feature input to the GCN. The input
should include features, xi and Xαi, and neighboring labels
yαi. Incorporating this information into a matrix will require
a bit of care. We place yαi and a zero in place of yi into a
vector with length (K +1), so the model has no access to yi.
We also use an indicator vector e to indicate that the instance
i is the one to be predicted. Then the GCN input is expressed
by a matrix H0 with size (K + 1) × (2 + d).
H0 =

0
1
x⊤
i
yαi
0
Xαi

.
(10)
The locations sβi can be included in the feature matrix X as
features if there is reason to suspect spatial trends.
Then the KCN model is defined to be a GCN followed by
a dense layer. The KCN is formally defined as
HL = GCN(A, H0),
(11)
ˆyi = σ
 e⊤HLwden

.
(12)
Here A and H0 are the adjacency matrix and the input fea-
ture matrix constructed from the neighborhood of i. Note
that every data point i gets its own A and H0, whose in-
dex i is omitted for notational simplicity. The vector e is the
indicator vector for i: it takes the first vector of HL, cor-
responding to i, as the input to the dense layer. The dense
layer allows for a final transformation of the data without
interference from neighbors.
The KCN parameters are all weight matrices, θ
=
{W1, . . . , WL, wden}. We train the KCN model by min-
imizing the loss in (8). Then we can predict the label of a
new data point using its features and neighbors in the train-
ing set. Algorithm 1 summarizes the training procedure of
the KCN.
Compared to local kriging, which only uses nearest neigh-
bors for kriging, the KCN uses the same input. However, the
KCN is much more flexible. When the training set is large
enough such that the overfitting issue is less of a concern,
the KCN model has clear advantages.

Compared to the direct application of a GCN on spatial
data, a KCN is able to use labels from neighbors directly.
Furthermore, a KCN does not need to use the test data points
to form the graph. Therefore, it does not need to re-train the
model when there is a new batch of test data points.
The KCN is also similar to the KNN classifier but is much
more powerful: while the KNN simply averages the labels of
neighbors, the KCN uses a neural network as the predictive
function.
KCN with Graph Attention
The recent success of attention mechanism on GNNs
inspires us to try the Graph Attention network (GAT)
(Veliˇckovi´c et al. 2018) as the predicting model. The orig-
inal GAT model computes attention weights with a neural
network; it also requres that the attention weights of a node’s
neighbors sum up to 1. Here we use the dot-product self-
attention (Vaswani et al. 2017) so that the model has a choice
to fall back on the GCN model.
Suppose the input feature at the ℓ-th layer of the GCN is
Hℓ−1, then we compute an attention matrix U by
P = Hℓ−1Watt,
M = σ(PP⊤),
Λ = diag(M),
U = Λ−1
2 MΛ−1
2 .
(13)
Here Watt is the weight matrix for the attention mechanism.
It projects input features into a new space. Then the attention
weights are decided by inner products between features in
this new space. We normalize the attention matrix so that
the diagonal elements of U are always one.
In each layer ℓ, we get an attention matrix Uℓas above.
Then we use Aatt
ℓ
= A ⊙Uℓas the new adjacency matrix
used in layer ℓ. The actual computation is
Hℓ= σ
 Aatt
ℓHℓ−1Wℓ
,
ℓ= 1, . . . , L
(14)
We call this new model the KCN-att. When the matrix
Watt has small weights, then U approaches a matrix with
all entries being one. In this case, the KCN-att becomes sim-
ilar to the KCN. When the matrix Watt has large weights,
then U tends to approach the identity matrix, and then the
KCN-att tends to reduce neighbors’ influence.
KCN based GraphSAGE
We also use GraphSAGE (Hamilton, Ying, and Leskovec
2017) as the predictive model of the KCN given that Graph-
SAGE performs well on several node classification tasks.
GraphSAGE cannot use a weighted graph, so we treat the
graph over the neighborhood of i as a complete graph. Let
Hℓ−1 = {hℓ−1
k
: k ∈βi} be the input to the GraphSAGE
layer ℓ, then the layer computes its output Hℓas follows.
gℓ
j = AGG
 {hℓ−1
k
, k ∈βi, k ̸= j}

,
∀j ∈βi
(15)
hℓ
j = σ
 Wℓ
1hℓ−1
j
+ Wℓ
2gℓ
j

,
∀j ∈βi
(16)
Hℓ=

hℓ
j/∥hℓ
j∥2 : ∀j ∈βi
	
(17)
The function AGG(·) aggregates a list of vectors into one.
We use the max-pooling aggregator, one the three aggre-
gators proposed in the original work (Hamilton, Ying, and
Leskovec 2017).
AGG(Hℓ−1
βi\j) = max(σ(Wpoolhℓ−1
k
+ b), k ∈βi, k ̸= j)
Here max takes the element-wise max values over a list of
vectors. We refer to this model as the KCN-sage.
Analysis
Computation Complexity
The time complexity of KCN and the two variants includes
nearest-neighbor search and network training. In order to
find the K nearest neighbors we utilize a KD tree, which
takes O(N log(N)) time to build. Here we treat the dimen-
sionality of spatial coordinates as a constant because it usu-
ally a small number (2 or 3). Querying a single data point in
the tree takes time O(K log(N)), and searching neighbors
for all data points takes a total of O(NK log(N)) time.
When we train the model on a single instance, the com-
putation of the adjacency matrix takes time O(K2). The
computation within each layer takes time O(K2dmax), with
dmax being the largest dimensionality of hidden layers. The
forward computation and backpropagation for one instance
takes time O(K2Ldmax), and one training epoch takes time
O(NK2Ldmax).
Relation to Kriging
KCN is a flexible model and approximately includes local
kriging as a special case. This fact is shown by the following
theorem.
Theorem 1: Assume the variogram of a kriging model sat-
isfies 2γ(0) > 0 2. Also assume ˜X = [x∗, X⊤
α∗]⊤has full
column rank. Then there exists a set of special parameters
and activations with which a KCN makes the same predic-
tion as the kriging prediction, i.e. ˆyKCN
∗
= ˆykriging
∗
.
Proof sketch: Let ˜Γ be the covaraince matrix corresponding
to the new data point and training data point.
˜Γ =

0
γ⊤
γ
Γ

.
(18)
Here γ and Γ are semivariograms defined in the same way
as kriging. To approximate kriging, we set the KCN to have
one convolutional layer and a dense layer. We set
¯A = ˜Γ−1 + ˜Γ−1 ˜X( ˜X⊤˜Γ−1 ˜X)−1 ˜X⊤˜Γ−1
(19)
as the “normalized adjacency matrix” and directly use it to
multiply the hidden input. We consider a 1-layer GCN with a
special activation function σdiv(·). The first row of the GCN
output is e⊤HL = σdiv
 e ¯AH0W1
. In the Appendix we
show e ¯A =
h
z−1, −z−1λ⊤i
, then
e ¯AH0 =
h
−z−1λ⊤yα∗, z−1, z−1(x∗−λ⊤Xα∗)
i
.
Here z is a scalar, and λ is the kriging coefficient defined
in (2). Let W1 be the matrix taking the first two elements
2The value 2γ(0) is called the nugget of the variogram, which
is usually greater than zero.

of the vector, then e ¯AH0W1 = [−z−1λ⊤yα∗, z−1]. De-
note it as u. Define the activation function to be σdiv(u) =
[−u1/u2, 0], set wden = [1, 0]⊤, and set the activation of
the fully connected layer to be identity, then the KCN pre-
dicts ˆyKCN
∗
= λ⊤yα∗, which is exactly the same as kriging
prediction ˆykriging
∗
.
In the real implementation, we use normal activation func-
tions such as ReLU. The combination of the first two rows of
W1, the GCN activation, and the dense layer can be viewed
as two-layer feedforward network applied to u. If the two-
layer neural network can emulate the function −u1/u2, then
a normal setting of the network can also approximate kriging
well.
□
This theorem and its proof have strong implications for
our model development. First, if the KCN uses ¯A defined
above as the normalized Laplacian, then the KCN has a
straightforward way to discover kriging solutions. Since ¯A
has a small size, (K + 1) × (K + 1), the computation of
¯A is affordable. Second, the matrix ¯A indicates that we
should introduce the feature matrix into the computation of
the “normalized Laplacian”. Otherwise, the KCN may need
complicated computations to recover kriging results. This is
one main motivation behind our usage of graph attention in
KCN-att.
Experiment
We evaluate our methods on three tasks: bird count mod-
eling, restaurant rating regression, and precipitation regres-
sion. We use Kriging, Random Forest, Graph Convolution
Network, and Graph SAmple and aggreGatE as baselines.
Experiment setup
Kriging: we use the implementation of Kriging within Au-
tomap (Hiemstra et al. 2008). Automap essentially auto-
mates the process of Kriging, by automatically fitting var-
iograms, and testing several different models. In all of our
experiments Automap tests spherical, exponential, gaussian,
matern, and stein variograms and picks the best one based on
the smallest residual sum of squares. Since all of the datasets
have a large number of data points, we use local kriging and
only consider the closest 100 points.
Random Forest: Hengl et al. (2018) use Random Forest to
make predictions for spatial data. For each data point, the
algorithm calculates the distances between that point and all
training points. These distances are then used as the feature
vector of that data point. This algorithm does not scale to
very large datasets, so we downsample the training set to a
size of 1000. We use the implementation of Random Forest
(Wright and Ziegler 2017), and method of tuning (Probst,
Wright, and Boulesteix 2018) used by the authors of Hengl
et al. (2018). The implementation tunes four hyperparame-
ters of Random Forest: the number of trees to use, the num-
ber of variables to consider at a node split, the minimal node
size, and the sample fraction when training each tree.
GCN: we modify Kipf’s implementation of (Kipf and
Welling 2016) for regression problems. Before we run the
GCN on spatial data, we first build a undirected graph over
data points: we connect two data points if one is among
(a)
(b)
Figure 1: Wood thrush (a) and observed counts over eastern
US, June 2014 (b).
the other’s K nearest neighbors. We only consider a GCN
with two hidden layers. We tune the hyper-parameters of the
GCN in the same way as we tune the KCN and the KCN-att
below.
GraphSAGE: we implement GraphSAGE with the Spektral
graph deep learning library. For each experiment, we build
an undirected graph in the same way as the GCN. Then we
train a two hidden layer GraphSAGE, with hyperparameters
are tuned as below.
KCN & KCN-att & KCN-sage: the three models use two
hidden layers respectively. We tune the following hyperpa-
rameters: hidden sizes ∈((20, 10), (10, 5), (5, 3)), dropout
rate ∈(0, 0.25, 0.5), and kernel length ∈(1, .5, .1, .05).
Note that GraphSAGE and KCN-sage do not consider
weighted adjacency matrix, so there is no need to tune kernel
length for them. We also employed early stopping to decide
the number epochs.
Bird count modeling
One application of our KCN models is modeling bird count
data from the eBird project (Fink et al. 2010), which con-
tains over one billion of records of bird observation events.
Modeling bird data from the eBird project provides an op-
portunity to deepen our understanding of birds as part of the
ecosystem. In this experiment, we model the distribution of
wood thrush in June, which is of great interests to onithol-
ogists (Johnston et al. 2019). Figure 1 shows a picture of a
wood thrush and the distribution of observed counts over the
eastern US.
We restrict our data to a subset of records of wood thrush
in June 2014. Each record has a GPS location, a count
of wood thrushes observed, and a list of features such as
observation time, count type (stationary count, traveling
count, etc.), effort hours, and effort area. After removing
583 records with uncertain counts or counts over 10, we
get 107,246 records to form our dataset. Bird counts in this
dataset are highly sparse: only 11,468 records (fraction of
0.11) have positive counts. We split the dataset into a train-
ing set and a test set by 1:1.
When we test our models and baselines, we consider
two evaluation metrics. The first one is mean squared error
(MSE), so we have a fair comparison with Kriging, the min-
imization objective of which is the mean squared error. The
second one is negative log-likelihood. We use a zero-inflated
Poisson distribution (Lambert 1992) as the predictive distri-

methods
Kriging
RF
GCN
GraphSAGE
KCN
KCN-att
KCN-sage
MSE
1.56 ± .85
0.68 ± .03
0.70 ± .02
0.53 ± .01
0.50 ± .01
0.49 ± .01
0.44 ± .01
NLL
n.a.
n.a.
1.82 ± .00
1.73 ± .00
1.60 ± .00
1.58 ± .00
1.51 ± .00
Table 1: Experiment results on the bird count dataset. Performances are measured by the mean squared error and the negative
log likelihood of preditions. Smaller values are better.
Figure 2: Mean Squared Error of GCN, GraphSAGE, KCN,
KCN-att, and KCN-sage using different numbers of neigh-
bors.
bution for each count. The model needs to output a logit u
for the Bernoulli probability and the mean λ of the Poisson
component. The probability of a count y given u and λ is
p(y) =

(1 −expit(u)) + ppoisson(y = 0)
if y = 0,
ppoisson(y = 0) if
y > 0.
(20)
Table 1 shows the performance of KCN, KCN-att, KCN-
sage, and baseline models. From this table, we can see
that the three KCN models significantly outperform baseline
methods. We also observed that GraphSAGE based methods
have superior performance than the GCN based methods, we
speculate it is because of the concatenation operation (plays
a role similar to a skip link) used in the GraphSAGE. Given
that bird counts are highly non-Gaussian, we don’t expect
kriging to perform very well. Random Forest gets much bet-
ter performance than kriging, but it overly smooths the train-
ing data given the small number of training points it can use.
The KCN and the KCN-att achieve similar performances.
We also study the performances of the GCN, Graph-
SAGE, KCN, KCN-att, and KCN-sage when different num-
bers of neighbors are used to form the graph. Figure 2 shows
performance values of the five models using different num-
bers. The GCN perform poorly when the number of neigh-
bors is small in the construction of the graph. In this case,
a test point might only connect to another test data point,
then the message propagation between two test points is not
helpful. GraphSAGE is robust to the number of neighbors.
In the KCN models, a data point has K training points as
direct neighbors, so KCN models can make better use of
the training data in this sense. When a KCN models uses
zero neighbors, it is equivalent to a fully connected neu-
ral network, and its performance deteriorates significantly.
It indicates that spatial correlation exists in the data. KCN,
KCN-att, and KCN-sage only need a small number of neigh-
bors to perform well. We speculate that a bird or its nest can
be observed only in a small spatial range, so the correla-
tion between near sites are strong but diminishes quickly as
the distance increases. The KCN-att performs slightly better
than the KCN because the KCN-att is able to use observa-
tory features to decide whether a neighboring count is from
the same situation or not.
Restaurant rating regression
Yelp is a popular rating website, which allows users to rate
and provide information about businesses. They have hosted
a large collection of these business ratings and attributes for
download. In this experiment, we only consider the restau-
rants within that dataset. Each restaurant has a GPS loca-
tion and an average rating rounded to the nearest .5, from 0
to 5. Additionally, we choose 13 related attributes from the
dataset, all but one of which is categorical. We turn these
categorical covariates into 30 indicator variables. These in-
dicators give information about restaurant attributes such as
whether it serves alcohol, and whether or not it takes credit
card. After we drop any rows where the ratings, coordinates,
or number of reviews is NA, we obtain 188,586 restaurants.
We then split the data 1:1 to form a training and test set.
Table 2 shows the experiment results on this dataset. The
KCN, the KCN-att, and the KCN-sage improve the perfor-
mance of their corresponding baseline models. The KCN,
the KCN-att, the KCN-sage, and the GraphSAGE outper-
form baseline models by a small margin. The regression task
on this dataset is a hard one. The features seem to not be very
useful. It is actually hard to overfit the labels with a nor-
mal feedforward neural network. However, there are some
weak spatial effect. The average rating over the entire dataset
achieves a mean squared error of 1.01 while the average of
the nearest 35 neighbors results in a mean squared error of
.984. This is understandable, since it is normal that good
restaurants and bad restaurants mix in the area. In this ex-
periment, we find that Kriging is very stable when features
are discrete and sparse. We add a small amount of noise to
the feature matrix to avoid numerical issues.
Precipitation regression
The National Oceanic and Atmospheric Administration
keeps detailed records of precipitation levels across the
United States. One such dataset provides monthly average
precipitation in inches from 1981 to 2010 across the US. We
average the precipitation level in May for 8,832 stations. We
then take the log of these average precipitation values as tar-
get values for the regression task. Essentially, we assumes
a log-normal distribution of precipitation levels. Finally we
have a target value, coordinates of each station, and one fea-
ture (the elevation) of each station. Data are split with por-
tion 1:1 as a training and testing. Figure 3 shows the data

Method
Kriging
Random Forest
GCN
GraphSAGE
KCN
KCN-att
KCN-sage
MSE
1.49 ± .008
1.04 ± .005
1.37 ± .006
0.969 ± .004
0.990 ± .004
0.977 ± .004
0.959 ± .004
Table 2: The results on the dataset of restaurant ratings. Performances are measured by MSE. Smaller values are better.
Method
Kriging
Random Forest
GCN
GraphSAGE
KCN
KCN-att
KCN-sage
MSE
.155 ± .013
.046 ± .003
.640 ± .023
.056 ± .003
.029 ± .002
.029 ± .002
.030 ± .002
Table 3: Experiment results on the precipitation dataset. Performances are measured by MSE. Smaller values are better.
Figure 3: Distribution of Precipitation values.
distribution over the US.
Table 3 summarizes the experimental results using the
mean squared error. The target values in the log-scale are
more likely to be from a Gaussian distribution than the pre-
vious two datasets, so Kriging performs relatively well com-
pared to other methods. The Random Forest method only
uses 1000 data points as the training data, so it omits a lot of
detailed variations. The GCN models perform poorly on this
dataset. One reason is that there are not many features for the
GCN to learn. The GCN model becomes more like a “gener-
ative” model that generates observations from hidden values.
Compared to the GCN, the KCN models particularly benefit
from this dataset because the KCN models work more like a
discriminative model. Note that discriminative models often
outperform generative models in supervised learning tasks.
Conclusion
In this work, we introduced the Kriging Convolutional Net-
work, a novel approach to modeling spatial data. Like krig-
ing, the KCN model directly use training labels in the pre-
diction. However, it enjoys the flexibility of neural networks
by using GNNs as the backbone. We further introduce the
attention mechanism to the model to create the KCN-att
model. The KCN-att model has better control over which
neighbors to use. Our analysis also reveals that KCN has a
straightforward method to approximate kriging models. In
the empirical study, we have compared KCN and KCN-att
with three baselines on three applications. The experiment
results shows the superiority of two KCN models over base-
lines. They indicates that feeding in observed labels to the
model is a powerful way to improve the performance.
Appendix: Detailed Proof of Theorem 1
We need to derive u = e ¯A ˜X, where e = [1, 0⊤]⊤,
¯A is “normalized adjacency matrix” defined in (19), and
˜X = [x∗, X⊤]⊤are feature vectors. We create the following
shorthand notations to facilitate our derivation.
t = −γ⊤Γ−1γ, a = −Γ−1γ, c = x∗+ X⊤a,
T = X⊤Γ−1X, B = XT−1, r = c⊤T−1c
We will show that the first row of ¯A is
e⊤¯A =
h
z−1, −z−1λ⊤i
,
z = −γ⊤Γ−1γ + γ⊤Γ−1XT−1X⊤Γ−1γ.
By checking (19), we first compute the inverse ˜Γ−1 is
˜Γ−1 =

t−1
t−1a⊤
t−1a
Γ−1 + t−1aa⊤

= t−1

1
a

[1, a⊤] +

0
0⊤
0
Γ−1

.
Denote v1 = e˜Γ−1 = t−1[1, a⊤].
We then consider the second term in (19). We have
˜X⊤˜Γ−1 = t−1c[1, a⊤] + [0, X⊤Γ−1]
= [t−1c, t−1ca⊤+ X⊤Γ−1].
Denote S = ( ˜X⊤˜Γ−1 ˜X)−1,
S = (t−1cc⊤+ T)−1
= T−1 −T−1c(t + c⊤T−1c)−1c⊤T−1.
The first line is from the equation [1, a⊤] ˜X = c⊤.
Denote v2 = e⊤˜Γ−1 ˜XS ˜X⊤˜Γ−1. Insert the expansion of
˜X⊤˜Γ−1, we have
v2 = t−1c⊤S[t−1c, t−1ca⊤+ X⊤Γ−1].
By c⊤S = t(t+r)−1cT−1 and w = c⊤Sc = rt(t+r)−1,
we have
v2 = (t + r)−1[rt−1, rt−1a⊤+ c⊤T−1X⊤Γ−1].
Since t−1 −(t + r)−1rt−1 = (t + r)−1, we have
e⊤¯A = v1 −v2
= (t + r)−1[1, (a⊤−c⊤T−1X⊤Γ−1)].

Then we expand a and c to get
e⊤¯A = (t + r)−1[1,
−(γ + Bx∗−BXΓ−1γ)⊤Γ−1)]
= z−1[1,
−λ⊤]
Here z = t + r.
Since ¯A are computed from normal matrix operations, its
entries are bounded. Therefore, z ̸= 0.
Acknowledgments
This work has been supported in part by Gordon & Betty
Moore Foundation, NSF CISE-1908617, and and NSF CRII-
1850358. We thank all reviewers of this work for their in-
sightful comments.
References
[Atwood and Towsley 2016] Atwood, J., and Towsley, D.
2016. Diffusion-convolutional neural networks. In Advances
in Neural Information Processing Systems, 1993–2001.
[Cressie 1991] Cressie, N. A. C. 1991. Statistics for spatial
data. Wiley series in probability and mathematical statistics.
Applied probability and statistics. New York: J. Wiley.
[Dance 2018] Dance, T. 2018. A comparison of linear and
non-linear kriging techniques for predicting the probability
of exceeding a threshold value. Ph.D. Dissertation.
[Fink et al. 2010] Fink, D.; Hochachka, W. M.; Zuckerberg,
B.; Winkler, D. W.; Shaby, B.; Munson, M. A.; Hooker, G.;
Riedewald, M.; Sheldon, D.; and Kelling, S. 2010. Spa-
tiotemporal exploratory models for broad-scale survey data.
Ecological Applications 20(8):2131–2147.
[Gao and Liu 2014] Gao, H., and Liu, H. 2014. Data analy-
sis on location-based social networks. In Mobile social net-
working. Springer. 165–194.
[Gori, Monfardini, and Scarselli 2005] Gori,
M.;
Monfar-
dini, G.; and Scarselli, F. 2005. A new model for learn-
ing in graph domains. In Proceedings. 2005 IEEE Inter-
national Joint Conference on Neural Networks, 2005., vol-
ume 2, 729–734. IEEE.
[Hamilton, Ying, and Leskovec 2017] Hamilton, W.; Ying,
Z.; and Leskovec, J. 2017. Inductive representation learning
on large graphs. In Advances in Neural Information Pro-
cessing Systems, 1024–1034.
[Hengl et al. 2018] Hengl, T.; Nussbaum, M.; Wright, M. N.;
Heuvelink, G. B.; and Gr¨aler, B. 2018. Random forest as
a generic framework for predictive modeling of spatial and
spatio-temporal variables. PeerJ 6:e5518.
[Hiemstra et al. 2008] Hiemstra,
P.;
Pebesma,
E.;
Twenh”ofel, C.; and Heuvelink, G.
2008.
Real-
time
automatic
interpolation
of
ambient
gamma
dose
rates
from
the
dutch
radioactivity
monitor-
ing
network.
Computers
&
Geosciences.
DOI:
http://dx.doi.org/10.1016/j.cageo.2008.10.011.
[Johnston et al. 2019] Johnston,
A.;
Hochachka,
W.;
Strimas-Mackey, M.; Gutierrez, V. R.; Robinson, O.; Miller,
E.; Auer, T.; Kelling, S.; and Fink, D. 2019. Best practices
for making reliable inferences from citizen science data:
case study using ebird to estimate species distributions.
bioRxiv 574392.
[Kipf and Welling 2016] Kipf, T. N., and Welling, M. 2016.
Semi-supervised classification with graph convolutional net-
works. arXiv preprint arXiv:1609.02907.
[Lambert 1992] Lambert, D. 1992. Zero-inflated poisson re-
gression, with an application to defects in manufacturing.
Technometrics 34(1):1–14.
[Li et al. 2017] Li, Y.; Yu, R.; Shahabi, C.; and Liu, Y. 2017.
Diffusion convolutional recurrent neural network: Data-
driven traffic forecasting. arXiv preprint arXiv:1707.01926.
[Probst, Wright, and Boulesteix 2018] Probst,
P.;
Wright,
M.; and Boulesteix, A.-L. 2018. Hyperparameters and tun-
ing strategies for random forest. Wiley Interdisciplinary Re-
views: Data Mining and Knowledge Discovery.
[Saito and Goovaerts 2000] Saito, H., and Goovaerts, P.
2000. Geostatistical interpolation of positively skewed and
censored data in a dioxin-contaminated site. Environmental
Science & Technology 34(19):4228–4235.
[Vaswani et al. 2017] Vaswani, A.; Shazeer, N.; Parmar, N.;
Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L. u.; and
Polosukhin, I. 2017. Attention is all you need. In Guyon, I.;
Luxburg, U. V.; Bengio, S.; Wallach, H.; Fergus, R.; Vish-
wanathan, S.; and Garnett, R., eds., Advances in Neural In-
formation Processing Systems 30. 5998–6008.
[Veliˇckovi´c et al. 2018] Veliˇckovi´c,
P.;
Cucurull,
G.;
Casanova, A.; Romero, A.; Li`o, P.; and Bengio, Y. 2018.
Graph attention networks. In International Conference on
Learning Representations.
[Wallin and Bolin 2015] Wallin, J., and Bolin, D. 2015. Geo-
statistical modelling using non-gaussian mat´ern fields. Scan-
dinavian Journal of Statistics 42(3):872–890.
[Wright and Ziegler 2017] Wright, M. N., and Ziegler, A.
2017. ranger: A fast implementation of random forests for
high dimensional data in C++ and R. Journal of Statistical
Software 77(1):1–17.
[Wu et al. 2019] Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang,
C.; and Yu, P. S. 2019. A comprehensive survey on graph
neural networks. arXiv preprint arXiv:1901.00596.
[Xingjian et al. 2015] Xingjian, S.; Chen, Z.; Wang, H.; Ye-
ung, D.-Y.; Wong, W.-K.; and Woo, W.-c. 2015. Convolu-
tional lstm network: A machine learning approach for pre-
cipitation nowcasting. In Advances in neural information
processing systems, 802–810.
[Yan et al. 2019] Yan, X.; Ai, T.; Yang, M.; and Yin, H. 2019.
A graph convolutional neural network for classification of
building patterns using spatial vector data. ISPRS journal of
photogrammetry and remote sensing 150:259–273.
[Yu, Yin, and Zhu 2017] Yu, B.; Yin, H.; and Zhu, Z. 2017.
Spatio-temporal graph convolutional networks: A deep
learning framework for traffic forecasting. arXiv preprint
arXiv:1709.04875.
[Zhou et al. 2018] Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.;
Liu, Z.; and Sun, M.
2018.
Graph neural networks:
A review of methods and applications.
arXiv preprint
arXiv:1812.08434.

[Zhu and Liu 2018] Zhu, D., and Liu, Y. 2018. Modelling
spatial patterns using graph convolutional networks (short
paper).
In 10th International Conference on Geographic
Information Science (GIScience 2018). Schloss Dagstuhl-
Leibniz-Zentrum fuer Informatik.
